Project Overview

  • Goal: Construct a model that accurately predicts whether an individual makes more than 50k/yr
  • Motivation: Help a non-profit identify donor candidates and understand how large of a donation to request
  • Data Source: 1994 US Census Data UCI Machine Learning Repository*

Note: Datset donated by Ron Kohavi and Barry Becker, from the article "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid". Small changes to the dataset have been made, such as removing the 'fnlwgt' feature and records with missing or ill-formatted entries.

In [1]:
import numpy as np                                # Package for numerical computing with Python
import pandas as pd                               # Package to work with data in tabular form and the like
from scipy.stats import skew
from time import time                             # Package to work with time values
from IPython.display import display               # Allows the use of display() for DataFrames
import matplotlib.pyplot as plt                   # Package for plotting
import seaborn as sns                             # Package for plotting, prettier than matplotlib
import visuals as vs                              # Adapted from Udacity
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import statsmodels.api as sm
In [2]:
# iPython Notebook formatting
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# Account for changes made to imported packages
%load_ext autoreload
%autoreload 2
In [3]:
data = pd.read_csv("census.csv")

EDA: Data Dictionary

  • age: continuous.
  • workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  • education_level: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  • education-num: continuous.
  • marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
  • occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  • relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  • race: Black, White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other.
  • sex: Female, Male.
  • capital-gain: continuous.
  • capital-loss: continuous.
  • hours_per-week: continuous.
  • native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
In [4]:
data.info(null_counts=True)   # Show information for each factor: NaN counts and data-type of column
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45222 entries, 0 to 45221
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   age              45222 non-null  int64  
 1   workclass        45222 non-null  object 
 2   education_level  45222 non-null  object 
 3   education-num    45222 non-null  float64
 4   marital-status   45222 non-null  object 
 5   occupation       45222 non-null  object 
 6   relationship     45222 non-null  object 
 7   race             45222 non-null  object 
 8   sex              45222 non-null  object 
 9   capital-gain     45222 non-null  float64
 10  capital-loss     45222 non-null  float64
 11  hours-per-week   45222 non-null  float64
 12  native-country   45222 non-null  object 
 13  income           45222 non-null  object 
dtypes: float64(4), int64(1), object(9)
memory usage: 4.8+ MB
In [5]:
data.describe(include='all').T    # Summarize each factor, transpose the summary (personal preference)
Out[5]:
count unique top freq mean std min 25% 50% 75% max
age 45222 NaN NaN NaN 38.5479 13.2179 17 28 37 47 90
workclass 45222 7 Private 33307 NaN NaN NaN NaN NaN NaN NaN
education_level 45222 16 HS-grad 14783 NaN NaN NaN NaN NaN NaN NaN
education-num 45222 NaN NaN NaN 10.1185 2.55288 1 9 10 13 16
marital-status 45222 7 Married-civ-spouse 21055 NaN NaN NaN NaN NaN NaN NaN
occupation 45222 14 Craft-repair 6020 NaN NaN NaN NaN NaN NaN NaN
relationship 45222 6 Husband 18666 NaN NaN NaN NaN NaN NaN NaN
race 45222 5 White 38903 NaN NaN NaN NaN NaN NaN NaN
sex 45222 2 Male 30527 NaN NaN NaN NaN NaN NaN NaN
capital-gain 45222 NaN NaN NaN 1101.43 7506.43 0 0 0 0 99999
capital-loss 45222 NaN NaN NaN 88.5954 404.956 0 0 0 0 4356
hours-per-week 45222 NaN NaN NaN 40.938 12.0075 1 40 40 45 99
native-country 45222 41 United-States 41292 NaN NaN NaN NaN NaN NaN NaN
income 45222 2 <=50K 34014 NaN NaN NaN NaN NaN NaN NaN
In [6]:
n_records = data.shape[0]                                                   # First element of .shape indicates n
n_greater_50k = data[data['income'] == '>50K'].shape[0]                     # n of those with income > 50k
n_at_most_50k = data.where(data['income'] == '<=50K').dropna().shape[0]     # .where method requires dropping of na for this
greater_percent = round((n_greater_50k / n_records)*100,2)                  # Show proportion of > 50k to whole data

data_details = {"Number of observations": n_records,
                "Number of people with income > 50k": n_greater_50k,
                "Number of people with income <= 50k": n_at_most_50k,
                "Percent of people with income > 50k": greater_percent}     # Cache values of analysis

for item in data_details:                                                   # Iterate through the cache
    print("{0}: {1}".format(item, data_details[item]))                      # Print the values
Number of observations: 45222
Number of people with income > 50k: 11208
Number of people with income <= 50k: 34014
Percent of people with income > 50k: 24.78
In [7]:
sns.pairplot(data)
plt.show()

Data Engineering/Preprocessing

Before this data can be used for modeling and application to machine learning algorithms, it must be cleaned, formatted, and structured.

Eng: Factor Names

Factor names with special characters, like -, can cause issues, so a cleaning may prove helpful.

In [8]:
name_changes = {x: x.replace("-", "_") for x in data.columns.tolist() if "-" in x}
data = data.rename(columns=name_changes)

Eng: Categorical Transformations

Working with categorical variables often involves transforming strings to some other value, frequently 0 or 1 for binomial factors, and {X = x_{0}, x_{1}, ..., x_{n} | 0, 1, .. n} multinomial.

These values may be ordinal (i.e. values with relationships that can be compared as a ranking, e.g. worst, better, best), or nominal (i.e. values indicate a state, e.g. blue, green, yellow).

In [9]:
ord_vars = ['income', 'workclass', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country']
nom_vars = ['education_level']
map_dict = {}
for name in ord_vars:
    map_dict[name] = {category:number for number,category in enumerate(data[name].unique())}
# map_dict
ed_lev_cat = {' Doctorate': 0,
              ' Prof-school': 1,
              ' Masters': 2,
              ' Bachelors': 3,
              ' Assoc-voc': 4,
              ' Assoc-acdm': 5,
              ' Some-college': 6,
              ' HS-grad': 7,
              ' 12th': 8,
              ' 11th': 9,
              ' 10th': 10,
              ' 9th': 11,
              ' 7th-8th': 12,
              ' 5th-6th': 13,
              ' 1st-4th': 14,
              ' Preschool': 15}

map_dict['education_level'] = ed_lev_cat
for name in map_dict:
    data['numeric_' + name] = data[name].map(map_dict[name])
In [10]:
for name in map_dict.keys():
    if name != 'native_country':
        message = 'Mapping for variable: numeric_{}'.format(name)
        print("=" * len(message))
        print(message)
        map_df = pd.DataFrame.from_dict(map_dict[name], orient='index').reset_index().rename(columns={'index': 'Factor Value', 0: 'Numerical Value'}).sort_values(by=['Numerical Value'])
        display(map_df)
====================================
Mapping for variable: numeric_income
Factor Value Numerical Value
0 <=50K 0
1 >50K 1
=======================================
Mapping for variable: numeric_workclass
Factor Value Numerical Value
0 State-gov 0
1 Self-emp-not-inc 1
2 Private 2
3 Federal-gov 3
4 Local-gov 4
5 Self-emp-inc 5
6 Without-pay 6
============================================
Mapping for variable: numeric_marital_status
Factor Value Numerical Value
0 Never-married 0
1 Married-civ-spouse 1
2 Divorced 2
3 Married-spouse-absent 3
4 Separated 4
5 Married-AF-spouse 5
6 Widowed 6
========================================
Mapping for variable: numeric_occupation
Factor Value Numerical Value
0 Adm-clerical 0
1 Exec-managerial 1
2 Handlers-cleaners 2
3 Prof-specialty 3
4 Other-service 4
5 Sales 5
6 Transport-moving 6
7 Farming-fishing 7
8 Machine-op-inspct 8
9 Tech-support 9
10 Craft-repair 10
11 Protective-serv 11
12 Armed-Forces 12
13 Priv-house-serv 13
==========================================
Mapping for variable: numeric_relationship
Factor Value Numerical Value
0 Not-in-family 0
1 Husband 1
2 Wife 2
3 Own-child 3
4 Unmarried 4
5 Other-relative 5
==================================
Mapping for variable: numeric_race
Factor Value Numerical Value
0 White 0
1 Black 1
2 Asian-Pac-Islander 2
3 Amer-Indian-Eskimo 3
4 Other 4
=================================
Mapping for variable: numeric_sex
Factor Value Numerical Value
0 Male 0
1 Female 1
=============================================
Mapping for variable: numeric_education_level
Factor Value Numerical Value
0 Doctorate 0
1 Prof-school 1
2 Masters 2
3 Bachelors 3
4 Assoc-voc 4
5 Assoc-acdm 5
6 Some-college 6
7 HS-grad 7
8 12th 8
9 11th 9
10 10th 10
11 9th 11
12 7th-8th 12
13 5th-6th 13
14 1st-4th 14
15 Preschool 15

Eng: Data Separation

For training an algorithm, it is useful to separate the label, or dependent variable ($Y$) from the rest of the data training_features, or independent variables ($X$).

In [11]:
income_raw = data['income']
features_raw = data.drop('income', axis=1)

Skew

The features capital_gain and capital_loss are positively skewed (i.e. have a long tail in the positive direction).

To reduce this skew, a logarithmic transformation, $\tilde x = \ln\left(x\right)$, can be applied. This transformation will reduce the amount of variance and pull the mean closer to the center of the distribution.

Why does this matter: The extreme points may affect the performance of the predictive model.

Why care: We want an easily discernible relationship between the independent and dependent variables; the skew makes that more complicated.

Why DOESN'T this matter: The distribution of the independent variables is not an assumption of most models, but the distribution of the residuals and homoskedasticity of the independent variable, given the independent variables, $E\left(u | x\right) = 0$ where $u = Y - \hat{Y}$ is of linear regression. In this analysis, the dependent variable is categorical (i.e. discrete or non-continuous) and linear regression is not an appropriate model.

In [12]:
cap_loss = data['capital_loss']
cap_gain = data['capital_gain']
cap_loss_skew, cap_loss_var, cap_loss_mean = skew(cap_loss), np.var(cap_loss), np.mean(cap_loss)
cap_gain_skew, cap_gain_var, cap_gain_mean = skew(cap_gain), np.var(cap_gain), np.mean(cap_gain)
fac_df = pd.DataFrame({'Feature': ['Capital Loss', 'Capital Gain'],
              'Skewness': [cap_loss_skew, cap_gain_skew],
              'Mean': [cap_loss_mean, cap_gain_mean],
              'Variance': [cap_loss_var, cap_gain_var]})
display(fac_df)
Feature Skewness Mean Variance
0 Capital Loss 4.516154 88.595418 1.639858e+05
1 Capital Gain 11.788611 1101.430344 5.634525e+07
In [13]:
fig = make_subplots(rows=2, cols=1)

fig.update_layout(height=800, width=1000,
                  title_text="Skewed Distributions of Continuous Census Data Features",
                  showlegend=False
                 )


fig.add_trace(
    go.Histogram(x=data['capital_loss'], nbinsx=25,
    name='Capital-Loss'),
    row=1, col=1

)

fig.add_trace(
    go.Histogram(x=data['capital_gain'], nbinsx=25,
    name='Capital-Gain'),
    row=2, col=1
)

fig.update_xaxes(title_text="Capital-Loss Feature Distribution", row=1, col=1)
fig.update_xaxes(title_text="Capital-Gain Feature Distribution", row=2, col=1)
fig.update_yaxes(title_text="Number of Records", range=[0, 2000], row=1, col=1)
fig.update_yaxes(title_text="Number of Records", range=[0, 2000], row=2, col=1)

fig.update_xaxes(title_text="Capital-Loss Feature Distribution", row=1, col=1)
fig.update_xaxes(title_text="Capital-Gain Feature Distribution", row=2, col=1)
fig.update_yaxes(title_text="Number of Records", range=[0, 2000],
                 patch = dict(
                     tickmode = 'array',
                     tickvals = [0, 500, 1000, 1500, 2000],
                     ticktext = [0, 500, 1000, 1500, ">2000"]),
                 row=1, col=1)

fig.update_yaxes(title_text="Number of Records", range=[0, 2000],
                 patch = dict(
                     tickmode = 'array',
                     tickvals = [0, 500, 1000, 1500, 2000],
                     ticktext = [0, 500, 1000, 1500, ">2000"]),
                 row=2, col=1)

fig.show()
In [14]:
train = data['capital_loss']
logit = sm.Logit(data['numeric_income'], train)
# fit the model
result = logit.fit()
print(result.summary())
Optimization terminated successfully.
         Current function value: 0.692991
         Iterations 3
                           Logit Regression Results                           
==============================================================================
Dep. Variable:         numeric_income   No. Observations:                45222
Model:                          Logit   Df Residuals:                    45221
Method:                           MLE   Df Model:                            0
Date:                Wed, 15 Apr 2020   Pseudo R-squ.:                 -0.2376
Time:                        21:43:14   Log-Likelihood:                -31338.
converged:                       True   LL-Null:                       -25322.
Covariance Type:            nonrobust   LLR p-value:                       nan
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
capital_loss  8.534e-05   2.28e-05      3.747      0.000    4.07e-05       0.000
================================================================================
In [17]:
skewed = ['capital_gain', 'capital_loss']
features_log_transformed = pd.DataFrame(data=features_raw)
features_log_transformed[skewed] = features_raw[skewed].apply(lambda x : np.log(x + 1))
In [18]:
fig = make_subplots(rows=2, cols=1)

fig.update_layout(height=800, width=1000,
                  title_text="Skewed Distributions of Continuous Census Data Features",
                  showlegend=False
                 )

fig.add_trace(
    go.Histogram(x=features_log_transformed['capital_loss'], nbinsx=25,
    name='Log of Capital-Loss'),
    row=1, col=1

)

fig.add_trace(
    go.Histogram(x=features_log_transformed['capital_gain'], nbinsx=25,
    name='Log of Capital-Gain'),
    row=2, col=1
)

fig.update_xaxes(title_text="Log of Capital-Loss Feature Distribution", row=1, col=1)
fig.update_xaxes(title_text="Log of Capital-Gain Feature Distribution", row=2, col=1)
fig.update_yaxes(title_text="Number of Records", range=[0, 2000],
                 patch = dict(
                     tickmode = 'array',
                     tickvals = [0, 500, 1000, 1500, 2000],
                     ticktext = [0, 500, 1000, 1500, ">2000"]),
                 row=1, col=1)

fig.update_yaxes(title_text="Number of Records", range=[0, 2000],
                 patch = dict(
                     tickmode = 'array',
                     tickvals = [0, 500, 1000, 1500, 2000],
                     ticktext = [0, 500, 1000, 1500, ">2000"]),
                 row=2, col=1)

fig.show()
In [19]:
log_cap_loss_skew = skew(features_log_transformed['capital_loss'])
log_cap_loss_var = round(np.var(features_log_transformed['capital_loss']),5)
log_cap_loss_mean = np.mean(features_log_transformed['capital_loss'])
log_cap_gain_skew = skew(features_log_transformed['capital_gain'])
log_cap_gain_var = round(float(np.var(features_log_transformed['capital_gain'])),5)
log_cap_gain_mean = np.mean(features_log_transformed['capital_gain'])
log_fac_df = pd.DataFrame({'Feature': ['Log Capital Loss', 'Log Capital Gain'],
              'Skewness': [log_cap_loss_skew, log_cap_gain_skew],
              'Mean': [log_cap_loss_mean, log_cap_gain_mean],
              'Variance': [log_cap_loss_var, log_cap_gain_var]})
fac_df = fac_df.append(log_fac_df, ignore_index=True)
fac_df['Variance'] = fac_df['Variance'].apply(lambda x: '%.5f' % x)
display(fac_df)
Feature Skewness Mean Variance
0 Capital Loss 4.516154 88.595418 163985.81018
1 Capital Gain 11.788611 1101.430344 56345246.60482
2 Log Capital Loss 4.271053 0.355489 2.54688
3 Log Capital Gain 3.082284 0.740759 6.08362

The logarithmic transformation reduced the skew and the variance of each factor.

Feature Skewness Mean Variance
Capital Loss 4.516154 88.595418 163985.81018
Capital Gain 11.788611 1101.430344 56345246.60482
Log Capital Loss 4.271053 0.355489 2.54688
Log Capital Gain 3.082284 0.740759 6.08362
In [24]:
# # Full Page - Code
!jupyter nbconvert Polishing_Donor_Classification.ipynb --output WIP_Class_Code --reveal-prefix=reveal.js --SlidesExporter.reveal_theme=serif --SlidesExporter.reveal_scroll=True --SlidesExporter.reveal_transition=none
# # Full Page - No Code
!jupyter nbconvert Polishing_Donor_Classification.ipynb --output WIP_Class_No_Code --reveal-prefix=reveal.js --SlidesExporter.reveal_theme=serif --SlidesExporter.reveal_scroll=True --SlidesExporter.reveal_transition=none --TemplateExporter.exclude_input=True
# # Slides - No Code
!jupyter nbconvert --to slides Polishing_Donor_Classification.ipynb --output WIP_Class_Slides --TemplateExporter.exclude_input=True --SlidesExporter.reveal_transition=none --SlidesExporter.reveal_scroll=True
[NbConvertApp] Converting notebook Polishing_Donor_Classification.ipynb to html
[NbConvertApp] Writing 5513146 bytes to WIP_Class_Code.html
[NbConvertApp] Converting notebook Polishing_Donor_Classification.ipynb to html
[NbConvertApp] Writing 5467127 bytes to WIP_Class_No_Code.html
[NbConvertApp] Converting notebook Polishing_Donor_Classification.ipynb to slides
[NbConvertApp] Writing 5470469 bytes to WIP_Class_Slides.slides.html
In [ ]: